JSC 370: Data Science II

Week 1: Introduction

2026-01-05

Course Details

Lectures: Mondays 1–3pm, MS 3278
Labs: Wednesdays 1–3pm, HS 108

  • Meredith Franklin: meredith.franklin@utoronto.ca
  • TAs: Johnny Meng and Mandy Yao

My Background

  • In late 2021 moved from Los Angeles where I was an Assistant/Associate Professor of Biostatistics at University of Southern California
  • From Canada: McGill (BSc), Ottawa/Carleton Institute of Math (MSc), Harvard (PhD), UChicago (postdoc)
  • Here I’m an Associate Professor with tenure in the Department of Statistical Science (51%) and the School of the Environment (49%)
  • Data Science Concentration Lead for the MScAC
  • Executive committee for the U of T Data Science Institute

My Teaching

  • Founded a Master’s of Health Data Science program at USC that launched in 2020
  • Co-taught the introduction data science course
  • Taught graduate-level spatial statistics, inference, linear models
  • At U of T I teach STA465/STA2016/ENV1112 (Spatial Data Analysis) and I have also taught STA255 (Statistical Theory) and ENV1197 (Research Methods)

My Research

  • Spatial statistical methods for environmental data
  • Environmental epidemiology
  • Data science techniques for remote sensing data/imagery
  • Focus on pollution (air, noise) and climate (ghg, land cover change)
  • Machine learning/Deep learning for spatiotemporal data

Course Goals

Through this course, you will hone techniques used in data science. You will learn:

  • Programming in Python, and tools Markdown/Quarto, Git
  • Exploratory data analysis — generating hypotheses and building intuition
  • Data visualization — interpretable summaries
  • Data collection — APIs, data scraping, wrangling, cleaning
  • Statistical and machine learning algorithms
  • Building a github.io website with interaction to communicate your work

Course Platforms

  • Course website: weekly breakdown, lecture slides, labs, datasets
    https://jsc370.github.io/JSC370-2026/

  • Quercus: announcements, Piazza discussion, grades, course logistics

What is data science?

  • Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.

Source: https://r4ds.hadley.nz/

What do data scientists actually do?

  • Frame a question and define success (what would convince you?)
  • Acquire data (files, databases, APIs, scraping)
  • Clean + validate (units, missingness, duplicates, joins)
  • Explore (EDA) → iterate → refine hypotheses
  • Model + evaluate (prediction, inference, uncertainty)
  • Communicate results (reports, dashboards, reproducible code)

Data Scientists have great responsibility to communicate effectively

Source: https://xkcd.com/605/

Why communication matters in Data Science

  • Opaque models without uncertainty estimation or discussion of limitations
  • Biased or unrepresentative data may lead to biased predictions
  • Confounding \(\ne\) causation (EDA can sometimes mislead)
  • Overfitting and weak validation

Top Job Titles

Data Scientists in Demand

  • Demand for data science skills is high across many sectors.
  • Skills show up again and again:
    • Python + libraries (pandas, numpy, sklearn)
    • SQL + relational thinking (joins, grouping)
    • Visualization + storytelling (matplotlib/plotline/plotly)
    • Reproducibility (git, environments, reports)
    • Communication (clear write-ups, assumptions, limitations)

Global labor market outlook

World Economic Forum (Future of Jobs Report 2025) projects that by 2030:

  • 170 million roles created (across the labor market)
  • 92 million roles displaced
  • Net +78 million jobs (~7% net growth)

These totals reflect all occupations (not just AI). Data/AI roles show up when we look at the fastest-growing job families. Source: World Economic Forum

Which jobs are growing fastest?

Fastest-growing roles (percentage terms) include:

  • Big Data Specialists
  • AI and Machine Learning Specialists
  • FinTech Engineers
  • Software and Application Developers

The data behind the plot

Quantity Value (millions)
Jobs created 170
Jobs displaced 92
Net change 78

{: .rowlines}

Visualization of these data

Show code
import matplotlib.pyplot as plt
labels = ["Jobs created", "Jobs displaced", "Net change"]
vals = [170, 92, 78]  # WEF article summary
plt.figure()
plt.bar(labels, vals)
plt.ylabel("Millions of jobs")
plt.title("Projected global job changes by 2030 (WEF)")
plt.tight_layout()
plt.show()

What is this course?

This course is an introduction to the world of data science following on from where JSC270 left off.

We will focus on transferable skills and modern workflows:

  • Python + VS Code for computing
  • Quarto for reproducible reports/slides
  • Git/GitHub for version control and collaboration

What you should expect

  • Weekly labs and bi-weekly homework using Python + Quarto
  • Submissions via GitHub Classroom (version control matters)
  • Focus on reproducibility: clean repos, clear reports, runnable code
  • Collaboration encouraged for discussion, but write-ups/code must be your own

Data Science Resources: Python

Python and VS Code

  • Python: general-purpose language widely used for data science
  • VS Code: a lightweight, extensible editor with strong Python + notebook support

What is VS Code?

  • An editor for code, notebooks, and markdown
  • Works well with:
    • Python environments (conda/venv)
    • Jupyter notebooks
    • Connecting to remote environments (e.g. Digital Alliance Canada, AWS)
    • Git/GitHub
    • Quarto render/preview

VS Code: Things you’ll use every week

  • Select the correct Python environment
  • Integrated terminal (run quarto render, git, ssh)
  • Source control panel (stage/commit/push)
  • Quarto preview for slides/reports

VS Code: Layout

Quarto and Python and VS Code

  • Quarto: markdown-based publishing for reports, sites, and slides

How does Quarto work?

  • You write a single source file (.qmd) that mixes text, code, and output options (YAML + chunk options).
  • Quarto executes code (e.g., Python via Jupyter) and captures tables, figures, and printed output.
  • It renders the document through Pandoc into a chosen format (HTML report, reveal.js slides, PDF, Word, etc.).
  • The final output is a self-contained deliverable (e.g., report.html) that can be shared or published.

Generating reports with Quarto

  • A Quarto document is a plain-text file with extension .qmd.
  • At the top is a YAML header that controls metadata and output (title, author, date, format, options).
  • The format field determines what Quarto produces:
    • html → a web report (.html)
    • pdf → a PDF report (LaTeX installed)
    • docx → a Word document
    • revealjs → slides

Generating reports with Quarto

An example yaml header for an html report

---
title: "My Report"
format: html
---
  • Rendering turns MyReport.qmd into MyReport.html (and supporting files if needed).
  • Core command: quarto render MyReport.qmd

Quarto code chunks

A code chunk might look like this:

Quarto code chunks

import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [1, 4, 9])
plt.show()

Quarto code chunks

Preamble (options):

echo: show/hide code

eval: run or skip

fig-width / fig-height: control plot size

Quarto Code chunks and options

Code:

import: libraries

.plot, .show: make figures and display them

GitHub

  • Version control is essential in industry and academia
  • Building a GitHub portfolio supports job hunting
  • You will build a github.io website as part of this course, with interactivity and app-type features

Git vs GitHub

  • Git: version control tool on your computer
  • GitHub: hosting + collaboration (remote repos, PRs, issues)
  • Workflow: edit → commit (git) → push (to GitHub)

This week: checklist

  • Install Python + VS Code + Quarto
  • Clone your GitHub Classroom repo
  • Render a .qmd locally
  • Commit + push your changes
  • Lab will be on your own this week, starting in person next week, January 14!

Next Week

  • Lecture: Monday January 12, 1–3pm (Version control)
  • Lab: Wednesday January 14, 1–3pm (VS Code, Python, Quarto, GitHub)